Customer Personality Analysis

1. Basic Statistics

Inference of Customer Statistics

1. The customer data has 2240 rows and 28 columns. No duplicates in data.
2. Income count has 2240 - 2216 = 24 values missing.
3. Maximum income corresponds to 666666.000000, while the percentiles are in between 35k to 68k. (Possible outliers).
4. Maximun spendings on Wines, Fruits, MeatProducts, FishProducts, SweetProducts, GoldProducts. (Possible outliers).
5. Mean Web site visits per month is 5 for 2k customers. The max value is 20.
6. AcceptedCmp, Complain and Response columns are binomial. (0 or 1). 
7. The enrollment date of newest customer is 06/12/2020, whereas of the oldest customer is 08/01/2018.

2. Data Wrangling

1. Identifying Missing values
2. Imputation of the missing values.    

3. Customer Analysis

4.1 Feature Engineering: New Relations

Total children a customer has, i.e., [Kidhome] + [Teenhome]

Since, the dataset contains the details of the customer enrolled with the company from 2018-2020. Assume that the data was collected in 2021 and hence calculate the Age of customers from their [Birth_Year].

Calculating for how many months a customer has been associated with the comapny. Upto the recent day.

Customer has spent the money on various products, such as Wines, fruits, meat products, fish products, sweet products and gold products. Summing up the entire expenditure of a customer as [total_spending].

The customers can be categorized into 4 distinct age groups with respect to their given age.

1. Teen
2. Adult
3. Middle Age Adult
4. Senior Adult

Customer Marital Status has many different string values, most of which fall under the same category. So, It can be broadly classified as:

1. Single
2. Partner 

Finding out the total family size of the customers.

Creating a new binary column to classify whether a customer is parent or not.

The education level or degree of the customers can be mainly categorized into 3 different groups:

1. Under Graduate
2. Graduate
3. Post Graduate

Now, we copy our dataset into newly formed Dataframe data_new. Also, discarding the redundant values from new Dataframe:

1. Marital_Status 
2. date_parsed
3. Z_CostContact, Z_Revenue
4. Promotions and deals, i.e., [AcceptedCmp]s
5. Year_Birth
6. ID
7. Age Group, complains and responses.

4.2 Feature Engineering: Outliers Treatment and Correlation

From the plot, it is clear that the columns Age and Income has some outliers. We will be removing those outliers.

Inference of the plot

1. Correlated components (> 0.78): [Income, total_spendings], [MntWines, totalspendings],
                                    [MntMeatProducts, totalspendings], [kids, family_size] and [kids, Is_parent]. 
2. Inference:
    -> High income customers spends more.
    -> Most customers spend more on Wines and Meat products.
    -> Customer with kids are more likely to have a family 

5. Data Preprocessing

1. Encoding of the categorical variables.
2. Scaling the data, since the range varies greatly in some of the coulumns.
3. Apply Dimensionality Reduction technique.

6. Dimensionality Reduction: Principal Component Analysis (PCA)

From the plot, it is clear that the first 5 Principal Components contributed almost 70% of the variance. We will reduce to 5 components using PCA.

7. Clustering

1. Use KMeans algorithm to find the optimum number of clusters with the Elbow method.
2. Create a Agglomerative Clustering Model (a hierarchical clustering method).

From the plot, the optimal number of clusters: 4. Now, to apply KMeans Clustering.

KMeans output: (array([0, 1, 2, 3], dtype=int32), array([518, 604, 560, 554]))

8. Evaluation of the Clusters

From the plot, we can interpret that:

1. Customers in Cluster 0: High Income and High Spendings.
2. Customers in Cluster 1: Low Income and Low Spendings.
3. Customers in Cluster 2: Average Income and High Spendings.
4. Customers in Cluster 3: Average Income and Low Spendings.

Now, looking at the investments pattern of the customers for different products.

From the plot, Customers in the Cluster 0 are the most profitable one, followed by Cluster 2 and so on.

Further, we consider the impact of Promotions and Campaign on our clusters of customers.

The plot depicts that no one take part in all 5 of the Campaigns. The responses to campaign has not been so good. Overall, only a few participants show interest towards it.

Solutions:

1. Better Planned Campaigns on customer interests.
2. Specific Campaigns for different types of customers.

From the plot, the deals offered attracted more customers unlike the promotional campaigns. The customers in clusters 2 and 3 seemeed to invest more on deals.

We will be profiling the clusters formed to identify who is our star customer and who needs more attention from the retail store's marketing team.

We will plot some of the features indicating customer's personality traits.

Customer Personality Inference

Customers in Cluster 0:

1. They are definitely not a parent.
2. Maximum 2 members in the family.
3. Equally Single and Married. But Singles are slighty greater.
4. All Ages.
5. High Income.

Customers in Cluster 1:

1. Majority of them are parents.
2. Maximum 3 members in the family.
3. Relatively Younger.
4. Lower Income Group.

Customers in Cluster 2:

1. Average to high Incomed.
2. Are definitely a Parent.
3. Family size of minimum 2 and maximum 4.
4. Relatively Older.
5. Most have a teenager at home.

Customers in Cluster 3:

1. Definitely a Parent.
2. Relatively Older.
3. Family size of minimum 2 and maximum 5. 
4. Have Small kid and teenager both.
5. Lower Income group.

9. Creating Production Pipeline

Pipeline-original cluster: (array([0, 1, 2, 3]), array([520, 604, 558, 554], dtype=int64))

Final Step: Testing Predictions

PIPELINE CLUSTERING ------------------------- ORIGINAL CLUSTER (MATCH ABOVE)

0: High Income and high spendings ----------- Cluster 0: High Income and High Spendings.

1: Low Income and Low Spendings ------------- Cluster 1: Low Income and Low Spendings.

2: Average Income and High Spendings -------- Cluster 2: Average Income and High Spendings.

3: Average Income and Low Spendings. ----------- Cluster 3: Average Income and Low Spendings.